Conversation
…so computes the 0.95 confidence interval around the mean BLEU score.
|
Thanks for this! I'd like to simply modify the one-line output for computer-readability, i.e., Mean BLEU score: 30.99 +/- 0.20 Any objections? |
|
Perfectly fine, go ahead :)
…On Tue, Nov 7, 2017, 6:16 AM Matt Post ***@***.***> wrote:
Thanks for this! I'd like to simply modify the one-line output for
computer-readability, i.e.,
cat newstest2017.uedin-nmt.4955.cs-en | ./sacrebleu.py -t wmt17 -l cs-en -b 10
Mean BLEU score: 30.99 +/- 0.20
BLEU+case.mixed+lang.cs-en+numrefs.1+test.wmt17+tok.13a+version.1.0.4 =
30.99 +/- 0.20 n=10 62.4/36.9/24.4/16.4 (BP = 1.000 ratio = 1.004 hyp_len =
61946 ref_len = 61718)
Any objections?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#11 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAUj6IujomqE6VZrZoyi6lJByX_SCQOsks5s0GZXgaJpZM4QUGLi>
.
|
|
So, don't hate me, but I had to refactor the main class to make an API. Do you want to try to rebase off master? If you don't get to it today I'll do that next. Then I'll push this out as version 1.1.0 and can hopefully leave it alone for a while... (This should actually be easier to implement with compute_bleu() now factored out). |
|
Another issue: have you tested this against the Moses implementation to ensure the results are the same? |
|
I'll look into this shortly. |
|
Hello, There's this |
|
As for the numpy dependency: despite the name sacreBLEU it would be nice to add also a character-based metric, e.g. chrF3 and there is a numpy implementation (probably much faster than the original pure-Python): https://github.com/awslabs/sockeye/pull/216/files |
|
I'm going to close this now in light of its age. Please feel free to re-open it if you have the time and inclination! |
|
Want to pick this up again, @cfedermann? |
This includes the SIGPIPE fix. Call with
--bootstrap-trials $nor-b $n. Anyn>1will result in bootstrap resampling to determine the BLEU score. Ifnumpyis available, the code also computes the 0.95 confidence interval around the final BLEU score.Uses a fixed random seed
12345to guarantee reproducible scores. This could later be made configurable in which case the sacreBLEU signature needs to be updated. Not needed for now, though.